Comparative analysis of radiomics and deep-learning algorithms for survival prediction in hepatocellular carcinoma

To examine the comparative robustness of computed tomography (CT)-based conventional radiomics and deep-learning convolutional neural networks (CNN) to predict overall survival (OS) in HCC patients. Retrospectively, 114 HCC patients with pretherapeutic CT of the liver were randomized into a development (n = 85) and a validation (n = 29) cohort, including patients of all tumor stages and several applied therapies. In addition to clinical parameters, image annotations of the liver parenchyma and of tumor findings on CT were available. Cox-regression based on radiomics features and CNN models were established and combined with clinical parameters to predict OS. Model performance was assessed using the concordance index (C-index). Log-rank tests were used to test model-based patient stratification into high/low-risk groups. The clinical Cox-regression model achieved the best validation performance for OS (C-index [95% confidence interval (CI)] 0.74 [0.57–0.86]) with a significant difference between the risk groups (p = 0.03). In image analysis, the CNN models (lowest C-index [CI] 0.63 [0.39–0.83]; highest C-index [CI] 0.71 [0.49–0.88]) were superior to the corresponding radiomics models (lowest C-index [CI] 0.51 [0.30–0.73]; highest C-index [CI] 0.66 [0.48–0.79]). A significant risk stratification was not possible (p > 0.05). Under clinical conditions, CNN-algorithms demonstrate superior prognostic potential to predict OS in HCC patients compared to conventional radiomics approaches and could therefore provide important information in the clinical setting, especially when clinical data is limited.


Study population
A total of 343 patients with initial diagnosis of HCC were discussed between January 2010 and October 2021 in the tumor board of our University Hospital.Subsequently, patients were selected according to the following inclusion criteria: (1) HCC patients who received a contrast-enhanced CT scan of the liver (consisting of at least an arterial and venous contrast phase) before therapy initiation; (2) the diagnosis of HCC had to be confirmed by a second imaging modality (e.g.ultrasound or magnetic resonance imaging) showing typical HCC changes, or by histopathological findings, according to the German HCC guideline 9 ; (3) initial therapy and at least one follow-up imaging was carried out at our hospital.
The exclusion criteria were: (1) incomplete CT scans (missing arterial or venous contrast phase); (2) CT scans with severe artifacts; (3) patients with another active tumor disease (defined as tumor diagnosis or therapy within 2 years prior to inclusion in the present study).
Based on these criteria (Fig. 1), a total of 114 patients were retrospectively enrolled and divided into a development (n = 85) and a validation (n = 29) cohort by stratified randomization, with stratification being performed on the initial therapy concept.Overall, the diagnosis of HCC was confirmed histopathologically in 60/114 patients, with the remaining 54/114 HCCs confirmed by imaging patterns.

Clinical variables and radiological characteristics
Demographic data and routine lab tests were obtained for all patients.This included age, gender, time to death or follow-up time, as well as the serological parameters [alpha-fetoprotein (AFP), alanine-aminotransferase (ALAT), aspartate-aminotransferase (ASAT), albumin, total bilirubin, creatinine, gamma-glutamyltransferase (GGT) and International Normalized Ratio (INR)].
Tumor characteristics (number of lesions, presence of metastases, volume and density values of the largest HCC lesion), imaging features (status of liver cirrhosis and ascites) and initial therapy concepts [surgical resection (RES), radiofrequency ablation (RFA), liver transplantation (TRANS), transarterial chemoembolization (TACE), radiotherapy (RT), systemic therapy (ST) and best supportive care (BSC)] were recorded in addition.The ALBI-, Child-Pugh-, MELD-Score and the BCLC stage were calculated using the respective established formulae and flow charts 3,[10][11][12][13] .The Child-Pugh Score was only evaluated in patients with suspected cirrhosis.Status of liver cirrhosis (present/absent) was assessed on CT by two residents (2 and 3 years of experience in liver imaging) analogously to Nebelung et al. 14 using the following criteria: hypertrophy of the liver segments I/II/III with concomitant atrophy of the segments VI/VII, surface and parenchymal nodularity of the liver, heterogeneous density values, portal vein enlargement, and ascites.Ascites was classified as absent, mild or moderate.In case of disagreement, the final decision was made by a senior radiologist with more than 15 years of experience

Overall survival
Overall survival was calculated as the period from the initial CT scan to the time of either death or last contact to our hospital (e.g.follow-up examination or discharge from inpatient stay).

Imaging protocol and annotation
The CT scans were acquired on a total of 24 different scanners at 21 medical centers.Seventy-eight patients (68%) received their initial CT at our hospital.External CT scans of the remaining 36 patients (32%) were transmitted to our institution as part of routine clinical practice.Common contrast media methods (for arterial contrast: bolus tracking or approximately 25 to 35 s after contrast agent injection; for venous contrast: approximately 60 to 70 s after contrast agent injection) were applied for image acquisition.See Supplementary Table S1 for the variability of more scan parameters.
A resident with two years of experience in liver imaging contoured the liver parenchyma and the largest HCC-lesion in both contrast phases using the open-source software 3D Slicer (http:// www.slicer.org) 15 .All segmentations were verified by the same resident 4 weeks later.An example of segmentation is shown in Supplementary Fig. S1.

Standardization of the CT datasets
Variations in the circulatory capacity of patients, contrast medium injection parameters, and imaging time contribute to interindividual flood points of contrast agent 16 .The resulting differences across patients may influence the radiomics data derived from these scans 17 .To address this, a self-developed standardization procedure was performed.
For each patient, the mean CT number (CTN mean ) in a circular segmentation within the aorta at the level of the coeliac trunk in the arterial and venous phase was recorded.The mean CTN for each phase (CTN mean,cohort ) was used to scale the CTN of each patient (CTN mean,i ) using the formula CTN new,i = CTN old,i × CTN mean,cohort CTN mean,i .To compensate for different slice thicknesses, all CT images were interpolated to an isotropic voxel size of 1.0 mm 3 .An anti-aliasing filter was applied, and contours were re-segmented to density values between − 200 and 500 Hounsfield units (HU).Details see Supplementary Table S2.

Conventional radiomics risk modelling
Radiomics features were extracted from the segmentations of the liver and HCC in the arterial (_art) und venous (_ven) phase.The extraction was implemented according to the recommendations by the Image Biomarker Standardization Initiative (IBSI) using the publicly available open-source Medical Image Radiomics Processor (MIRP) 18,19 .Feature values obtained from the venous phase images were subtracted from the corresponding arterial phase values to quantify differences between both phases (_diff).In summary, six feature subgroups were extracted (HCC_art, HCC_ven, HCC_diff, liver_art, liver_ven, liver_diff), resulting in 1146 imaging features per patient.
To develop conventional radiomics models, the "Fully Automated Machine Learning with Interpretable Analysis of Results" (FAMILIAR, version 1.2.0) framework (https:// github.com/ alexz wanen burg/ famil iar) was used 20 .The utilized settings for feature extraction and model building can be found in Supplementary Table S3.Three primary models were constructed to predict OS, consisting of a clinical model, an image-based radiomics model and a combined model of clinical and imaging data.Four supplementary models analyzing the imaging segmentations separately were additionally created to compare the predictive power for OS in the different contrast phases and imaging components (whole liver parenchyma vs. HCC).
For each model, feature importance was evaluated using a 15-times repeated threefold cross-validation scheme, resulting in 45 internal models in total.In each iteration, multiple feature processing steps were applied: missing value imputation, feature transformation, filtering and clustering.The overall importance of a feature was assessed by its occurrence within the top five highest ranked features in all 45 internal models.The signature size was assigned as the median signature size of all 45 internal models.The features with the highest importance were used to create a Cox proportional hazards model for the prediction of OS.Subsequently, the models were validated on the validation cohort.Details of feature processing and model development are given in Supplementary Table S3.

Deep-learning-based risk modelling
All segmentations of the liver and the HCC were considered.To accommodate for the large range of sizes observed for liver and HCC lesions across the image datasets, a cropping procedure was applied: all images were cropped to the 95th percentile of the distribution of liver or HCC sizes in each direction.In addition, all images were resampled to a voxel size of 2 × 2 × 2 mm 3 .The resulting image dimensions were 64 × 64 × 64 and 132 × 144 × 132 voxels for the HCC and the liver segmentations, respectively.The voxel intensities were rescaled to the interval [0, 1].To avoid overfitting on characteristics outside of the ROI, these regions were masked by setting all voxel intensities outside the ROI to zero.
For clinical data, missing values were imputed using the median value over all patients.If the percentage of missing features for one patient exceeded 30%, the patient was excluded.All clinical data were converted to a numerical scale.The features were transformed using Yeo-Johnson normalization and Z-standardization and www.nature.com/scientificreports/mapped linearly to the interval 0 to 1 based on the development cohort.Transformation parameters were applied to the validation cohort unchanged.Four primary deep-learning-based models were constructed to predict OS, consisting of a clinical model, two image-based models (based on the HCC and liver segmentations) and a combined model of clinical and HCC imaging data.As deep-learning algorithms require significantly more computing power, it was not possible to create an imaging model consisting of all CT data as in the conventional radiomics approach.Four supplementary models analyzing the imaging segmentations separately were additionally created to compare the predictive power for OS in the different contrast phases and imaging components (whole liver parenchyma vs. HCC).
All models were implemented using the Python-based deep-learning library Pytorch 21 .The general architecture of the networks was designed after Hosny et al., Starke et al., and Nie et al. and is illustrated in Supplementary Fig. S2 [22][23][24] .For example, the proposed image-based model consists of four convolutional layers and three fully connected layers.To regulate the model, batch normalization and dropout layers are incorporated.CT images of both arterial and venous phases form the input of the model.They are processed through the convolutional layers before being concatenated and further processed by the fully connected layers.According to Katzman et al., the loss function is set to the negative log of the Cox partial likelihood with regularizations 25 .Therefore, the final output is a single value representing the predicted hazard of the model.Details regarding the utilized hyperparameters can be found in Supplementary Table S4.
The number of training epochs was determined through a 15-times repeated threefold cross-validation, resulting in 45 internal models in total.Each model was trained for 500 epochs on the training fold and monitored for testing fold performance after every epoch.Model performance was assessed by the average performance of the last five epochs to reduce statistical noise.Finally, for validation, 45 models were trained on the entire development cohort using the number of epochs with the highest cross-validation performance.The final prediction for a patient was established by taking the average prediction of all 45 models.

Evaluation of prognostic performance
Prognostic performance was evaluated by the concordance index (C-index) and the ability to stratify patients into risk groups based on the model predictions.The C-index measures the agreement between the actual OS and the model predictions.A C-index of 0.5 indicates no prognostic value, while a value close to 1 indicates perfect prediction.Patients were allocated into a low-or high-risk group for death based on the hazard values predicted by the models.The median value of these predictions was used as a cutoff on the development cohort.Patients with a predicted hazard exceeding the cutoff were assigned to the high-risk group.The difference between the low-and high-risk group was assessed using the log-rank test.The significance level was set to α = 0.05.The confidence intervals (CI) for the internal cross-validation were calculated by analyzing the distribution of the 45 model performances.To estimate the CIs for the validation, the percentile bootstrap method was performed 26 .To compare the prognostic performance of two models, a two-sample bootstrap test was employed: The difference in C-indices was computed for 1000 bootstrap samples of the validation cohort.The smaller proportion of bootstrap samples in which the C-index difference was either greater than 0 or less than 0 was multiplied by 2 to obtain the p-value.

Study population
Development and validation cohort were balanced in terms of clinical parameters and baseline demographics (p > 0.05; Table 1).

Conventional radiomics approach
Three primary models were developed: a clinical model, a radiomics model including all imaging features and a model combining clinical and imaging data.In addition, four supplementary radiomics models (HCC_art; HCC_ven; liver_art; liver_ven) were developed based on the individual image segmentations.
The median signature sizes were three, six and seven for the clinical, image-based, and combined model, respectively, and ranged between two and five for the supplementary models.For the clinical model, six patients (four in development cohort and two in validation cohort) were excluded from the analysis due to missing values > 30%.The final Cox-regression models are reported in Table 2 for the primary analyses and in Supplementary Table S5 for the supplementary analyses.The results of the internal cross-validation are shown in Supplementary Table S6.

Comparison of the conventional radiomics and the deep-learning approach
The deep-learning approach outperformed the conventional radiomics approach, with a significant improvement for the liver_ven model (p = 0.032).Figure 3 highlights the differences of the C-indices in the validation cohort between the conventional radiomics and deep-learning models.Figure S3 shows the calibration plots of the best performing image-based models from both the conventional radiomics and deep-learning approaches.

Discussion
In the present study, we investigated whether conventional radiomics and deep-learning algorithms can predict OS in HCC patients based on CT data regardless of tumor stage or applied therapy and compared both methods for superiority.Overall, deep-learning algorithms outperformed conventional radiomics features and could help to predict OS.Still, the clinical Cox-regression model showed the best performance in the presented setting.
To the best of our knowledge, our study is the first radiomics analysis of CT scans for OS of HCC patients across all tumor stages and common therapies.Previous studies have focused on specific therapies or tumor stages.In addition, analyses were often based on only one contrast phase and rarely used the combination of HCC and liver parenchyma.
To date, the predictive power of deep-learning algorithms on CT images for predicting OS of HCC patients has not been comprehensively evaluated.Wang et al. reported a C-index of 0.58 for patients undergoing stereotactic radiotherapy 27 .Better results were observed in patients who received a TACE alone (C-indices = 0.65 and 0.73) or a combination of TACE and sorafenib (C-index = 0.72) [28][29][30] .The C-indices of 0.63-0.71obtained in the present study are in line with the listed values and thus show the potential for outcome prediction even in HCC patients receiving different therapeutic approaches, although no significant risk stratification was possible.
Conventional radiomics models for predicting OS of HCC patients have been evaluated more commonly so far.C-indices from literature range between 0.63-0.78and 0.60-0.67 for HCC patients undergoing surgery or TACE, respectively [31][32][33][34][35] .However, validation on holdout or external datasets was not always performed and risk stratification was not always possible.Here, the majority of our conventional radiomics models showed a performance close to random prediction.With a C-index of 0.66, only the best model (HCC_art) showed a value comparable to the literature.Each deep-learning image-based model outperformed its conventional radiomics counterpart with statistically significance for the venous liver model (liver_ven), leading us to the conclusion that deep-learning may offer an enhanced prognostic utility.The main reason for the limited performance of the conventional radiomics approach may be the heterogeneous study cohort.Previous studies reported lack of reproducibility of handcrafted radiomics features between different CT scanners, acquisition and reconstruction parameters 8,[36][37][38][39] .As we used CT data from 24 different scanners, acquisition parameters were heterogeneous, which may have negatively affected reproducibility of radiomics features.In contrast, features extracted from deep-learning may be more robust 40 .This observation aligns with findings from a comparative study on head and neck cancer OS prediction, which demonstrated that deep learning models exhibit superior generalizability across different institutions compared to conventional radiomics approaches 41 .Overall, the clinical model based on Cox-regression was superior to all imaging approaches with significantly different OS between the stratified risk groups suggesting a high importance of clinical factors for generalized prediction models.The parameters of the final signature, consisting of GGT, AFP and the HCC volume, have a known impact on the prognosis of affected patients.Elevated GGT levels may indicate liver damage, such as chronic hepatic parenchymal remodeling or HCC 42 , and are associated with OS in HCC 43,44 .AFP is the most common serum marker in HCC, with higher AFP levels associated with poorer OS 45,46 .HCC volume is associated with tumor malignancy and infiltrative behavior 47 .Wu et al. point out that tumor size at diagnosis is an independent prognostic factor for OS, irrespective of tumor grade, stage, or treatment selected 48 .Our results support these findings.However, other parameters, such as the MELD-Score were not included in the clinical model.Although this factor has been identified as predictor of HCC prognosis 49 , within the scope of our multi-step machine learning workflow and the heterogeneity of our patient cohort, the inclusion of additional clinical parameters did not give benefit to the predictive performance  www.nature.com/scientificreports/and its generalizability.As lack of clinical data continues to be a non-negligible problem in patient care in some cases, the development of image parameter-based deep-learning and conventional radiomics models is essential.Therefore, further studies in larger patient groups are essential to further explore the comparative potential between image-based algorithms and clinical models.Interestingly, the clinical deep-learning model was outperformed by the clinical Cox-regression model, although the difference was not statistically significant.One potential explanation for this finding is that the clinical deep-learning model may not have been fully optimized.Specifically, overfitting on the development cohort was observed, which suggests that further refinement and optimization of the model hyperparameters may lead to a better performance on future datasets.
The complexity of deep-learning models raises questions about their value for OS prediction.To increase the value of deep-learning models for OS prediction, the models should be interpretable and easily to understand and rely on for physicians.To improve the interpretability of the deep-learning models, their output was correlated with the conventional radiomics features.For the HCC model, a high correlation with the HCC volume was observed (Spearman R = 0.94), indicating that an increase in HCC volume corresponds to lower OS.This finding is consistent with the results of the clinical Cox-regression model of this study.Similarly, predictions of the liver model were associated with liver volume (R = 0.86), suggesting that an increase in liver volume corresponds to lower OS.This finding contradicts the expectation that progressing cirrhosis is associated with a decreasing liver volume leading to lower OS 50 .Future research should aim to better understand the complex patterns that deep-learning algorithms can detect.
There are some limitations of this study.First, it was a retrospective study with a small sample size of 114 patients with limited follow-up duration.Especially for CNNs, the limited sample size increases the risk of overfitting and reduces result reliability.However, multiple strategies were employed to minimize the risk of overfitting despite the limited sample size: (i) early stopping of the training process using cross-validation; (ii) masking the CT image to the ROI to prevent overfitting on surrounding anatomical structures; (iii) architectural considerations like batch normalization, the use of dropout layers and pooling layers and data augmentation; (iv) ensemble prediction by averaging the output from 45 individually trained models for the final prediction.The small resulting differences of CNN performances between the development and validation cohort suggest that overfitting is unlikely, despite the limited sample size.Second, the study population has high heterogeneity in terms of various factors such as applied treatment, CT acquisition protocols, and HCC tumor stages.While this heterogeneous group reflects everyday clinical practice, it may also limit the generalizability of the findings, as the specific distribution of clinical characteristics and treatment approaches may differ significantly across clinics.In addition, 1/85 and 1/29 patients in the development and validation cohort, respectively, were expected to die from treatment-related causes.Due to this minority, the impact on the developed models can be considered negligible.Third, not all clinical parameters were available and only the largest lesion was segmented in multifocal HCC.Whole tumor burden analysis may improve the efficiency of OS prediction, although all HCC lesions

Figure 2 .
Figure 2. Kaplan-Meier survival curves of patients stratified into risk groups (cutoff value = 1.024 years) by the clinical model in the development and validation cohort.Differences in OS between low-and high-risk groups were evaluated by the log-rank test.

Figure 3 .
Figure 3.Comparison of C-indices between the conventional radiomics and the deep-learning models in the validation cohort.Positive values indicate better performance of the conventional approach, whereas negative values indicate better performance of the deep-learning approach.The whiskers represent the 95% confidence interval.The horizontal line within the distributions illustrates the median value.For the comparison of the HCC and liver models, the primary radiomics model was used.*Statistically significant (p < 0.05).

Table 1 .
Patient characteristics of the development and validation cohort.The variables describing the CTN, and volume refer to the largest HCC lesion.P-values were obtained by using Chi-square homogeneity tests and two-sided Mann-Whitney U tests for categorical and numerical variables, respectively.AFP alpha-fetoprotein, ALAT alanine-aminotransferase, ASAT aspartate-aminotransferase, BCLC Barcelona clinic liver cancer; BSC best supportive care, CTN computed tomography number, GGT gamma-glutamyltransferase, HCC hepatocellular carcinoma, INR international normalized ratio, RES surgical resection, RFA radiofrequency ablation, RT radiotherapy, ST systemic therapy, TACE transarterial chemoembolization, TRANS liver transplantation.

Table 2 .
Final signatures of the primary clinical, image-based, and combined multivariate Cox-regression models and their respective parameters.The hazard ratio (HR) [95% CI] and the corresponding p-values of the regression are shown based on the development cohort.AFP alpha-fetoprotein, art arterial phase, diff difference, CI confidence interval, GGT gamma-glutamyltransferase, HCC hepatocellular carcinoma, ven venous phase.*Statistically significant (p < 0.05).

Table 3 .
Final performance of the Cox-regression models in development and independent validation: C-indices [95% CI] and p-values for risk stratification.art arterial phase, CI confidence interval, HCC hepatocellular carcinoma, ven venous phase.*Statistically significant (p < 0.05).

Table 4 .
Final performance of the deep-learning-based models in development and independent validation: C-Indices [95% CI] and p-values for risk stratification.art arterial phase, CI confidence interval, HCC hepatocellular carcinoma, ven venous phase.*Statistically significant (p < 0.05).